Importing main libraries

Load the dataset & understand the problem

Loading and description

Examine one sample

Sample label

Features of the sample

Encoding issues

The labels are encoded as string. Let's deal with this issue quickly by casting it with Series.astype() method.

Split the dataset

Since it's already shuffled and splitted we can just assigned a subset to our train and test variables

Training a binary classifier

Let's simplify the problem by considering an algo that only discriminate fives. The output will be :

Here we make use of the brilliant method cross_validate() that accepts a dictionary {'name':'metric'}

note\ I let you check the recall which give an indication that the cv of the cross validation parameter doesnt trigger a stratified kfold by default. Weird, since I think I read several times that it triggered a strat kfold for classification and standard kfold for regression...

The Accuracy issue with unbalanced labels

What would be the accuracy of an classifier that only output False ?


Baseline metrics

Confusion matrix

Understand the scheme

Each row is an actual class\ Each column is a predicted class.

TN | FP\ FN | TP

The derived metrics

recall : true positive over all actual positives -> TPR (True Positive Rate)\ precision : true positive over all predicted positives

The precision score compile the number of TP by the number of all predicted positive :\ precision = TP/(TP + FP)

But if the classifier outputs one single Positive which is True and all others preds are Negatives (False and True): precision = 1/(1+0) = 1 so the precision must be counter-balanced used the recall score:\ recall = TP/(TP + FN)

Aggregate metrics

f1 : the harmonic mean of precision and recall, meaning that it gives more weight to low results. This is a convenient way of comparing classifiers but we need to keep in mind that if f1 favors classifiers with highest precision and recall it favors in the same time classifiers having approximately the same score for the two metrics and this is not what we always want.

Exemple A

We want to diminush the value of false positive so we want the max of TP over P : high precision\ -> note that it will probably be at the cost of several good video not selected (FN) so low recall

Exemple B

We want every ill person (positive) to be detected even if it means a lot of healthy people will be wrongly declared ill (FP). So we want the lower value possible for FN : hence the recall must be the highest

Let's code all of this

We need the preds to compiles the various metrics more precisely than with cross_val_score() or cross_validate(). We get it with cross_val_predict and use with the predicted array the builtin metrics of sklearn. We get the prediction on the train set of course !

Custom func (because it's always better to work with proper tools

Classification report (not so usefull for a binary classifier)

Recall / Precision trade-off

By trade-off I mean the way that the threshold used by the decision function of the model influences the output of the prediction. To see this influence, it's usefull to plot the recall and the precision depending on the various threshold values.

The method sklearn.metrics.precision_recall_curve return the used arrays for precisions, recalls and thresholds (in that order)

Note

General method:

y_scores = cross_val_predict(model, X_train, y_train, cv=folds, 
                             method='decision_function'

But for XGBoost and other algo, we need to get y_scores with the method xgb_model.predict_proba(X)[:,1]

y_scores = xgbmodel.predict_proba(X)[:,1]

# of course it's waaaaay better with a cross validation workflow

y_scores = cross_val_predict(xgbmodel, X_train, y_train, cv=folds, 
                             method='predict_proba')[:,1]

Custom function to see the trade-off

Comprendre le threshold

On peut manipuler le threshold utilisé par le modèle. Dans les exemples suivants on voit comment obtenir le meilleur recall pour une precision de 0.9 puis la meilleure precision pour un recall de 0.78.

y_scores est un array qui contient les scores avec la note issue de la method decision-function:

array([  1200.93051237, -26883.79202424, -33072.03475406, ...,
        13272.12718981,  -7258.47203373, -16877.50840447])

Avec un treshold a n toutes les valeurs inférieures à n seront False (negatives) et toutes les valeurs supérieures ou égales seront True (positives)

Customiser les preds avec un threshold

  1. en fonction d'une precision donnée
# np.argmax renvoie le premier index de l'array vérifiant la condition
np.argmax(precision >= 0.90)

# valeur du treshold pour precision >= n
thresholds[np.argmax(precision >= 0.90)]  

# masque de l'array y_scores en fonction de ce threshold
(y_scores >= thresholds[np.argmax(precision >= 0.90)])

# formule condensée 
cust_preds = (y_scores >= thresholds[np.argmax(precision >= 0.90)])
  1. en fonction d'un recall donné

Le signe change car la courbe du recall est décroissante

# formule condensée 
cust_preds = (y_scores >= thresholds[np.argmax(recall <= 0.78)])

Exemples

Custom func to get the preds with adjusted threshold

This function returns the array of predictions and print the two metrics by default.

Precision/recall & ROC curves

Precision/recall curve

We just need to get the various precision, recall and thresholds(unused here) like with the precision-recall trade-off plot and plot recall over precision. This curve is a straight forward representation of what it means to increase artificially the recall by moving the threshold.

It's just another way of seeing this relationship between the two metrics.

When to use it over the PR curve ?

ROC curve

ROC (receiver operating characteristics) is also meant to use with binary only classifiers. It plots the TPR (True Positive Rate aka recall) over the FPR (False Positive Rate) which is FP/len(negative-class).

Note;\ FPR = FP/(FP + TN)\ TNR = TN/(FP + TN) # aka sensibility\ donc FPR = 1 - TNR

It's important to notice that the FPR = (1 - sensitivity) and should tend to 0 (sensitivity -> 1). This is another view on this trade-off: the higher the TPR the lower the TNR (or the higher the FPR)

Vizualize the curves

Binomial Classifier full metrics dashboard

Baseline metrics + curves : hover effects are pimped, you can try !

Comparing classifiers

It' very straight forward. Just compare score. Tricky part is the choice of the scoring metric. Lets import and train a Random Forest Classifier.

A visual assessment of the curve will be double check by the measurement of the area under the curve. Import this metrics also

Hiyaa, we can finally grab our results, plot everything, close the laptop and play football with kids.. ah but it's not as simple as we saw it in the last chapter. Like XGBoost and other ensemble algo Random Forest is not implementing a decision_function() method but a predict_proba() instead.

This predict_proba returns an array of dimension (m,n) such as

This is where it's become interesting since the n columns are containing the proba that a giving instance belong to this class. So if we have an array like this:

and now we can simply take the column of positive prediction which is in fact the score we're aiming for :

all_probas = cross_val_predict(X, y, method='predict_proba')
y_scores = all_probas[:,1]

# or more simply, but also quite less elegant, of course the model need to be fitted :
y_scores = rndfor_model.predict_proba(X)[:,1]

Train a Random Forest classifier

Comparing the curves

Since we're in the process of short selection an classifier, we're using a function that only compare:

This is more convenient to limit the comparison between only two model because a plot with more than two areas represented is not very readable. Any way let's plot! But before that we can structure a little bit our work. All arrays of probas or prediction, all variables of metrics, etc.. this a lot to remember. We can make use of a dataclass object that can store efficiently all of this and then access to what we want by the attribute of each model.

Make use of dataclasses to store our models

Here is the whole procedure:

Since we need only the decision_function array (or the predict_proba depending on the model) we get both array by a cross_val_predict witht the right method argument passed.

then we create the object storing all we need.

Here this is a quick function that genereate our various metrics and populate a dataclass object. Nothing fancy here, the template is the following:

@dataclass(frozen=True, order=True, eq=True, unsafe_hash=True)
class ClassifierModel:
    name: str = field(default='Base_name')
    model: ClassifierMixin = field(default=ClassifierMixin, compare=False)
    y: np.array = field(default=np.array, compare=False, repr=False)
    scores: np.array = field(default=np.array, compare=False, repr=False)
    fp_rates: np.array = field(default=np.array, compare=False, repr=False)
    tp_rates: np.array = field(default=np.array, compare=False, repr=False)
    thresholds: np.array = field(default=np.array, compare=False, repr=False)
    auc: float = field(default=float, compare=True)
    accuracy: float = field(default=float, compare=False)
    precision: float = field(default=float, compare=False)
    recall: float = field(default=float, compare=False)
    f1: float = field(default=float, compare=False)
    desc: str = field(default=f"Custom Object to store a model, and basic results", repr=True)

The resulting object is looking like this :

We can now create the object of the promising model:

Comparing results

We can finally compare our models:

Of course this is not a practical method to always compare two binary classifier, but it can be handy to be abble to display a clear and visual summary of the difference between two models. In the end, all this process can be done with theses lines of code:

# pipelining
from sklearn.pipeline import Pipeline

# data processing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# model evaluation
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_validate, StratifiedKFold, cross_val_predict

# classifier
from sklearn.neighbors import KNeighborsClassifier


# model structure and classifier implement
base_clf = Pipeline([
    ('pca', PCA(n_components=0.5, iterated_power=7)),
    ('scaler', StandardScaler()),
    ('clf', KNeighborsClassifier(n_jobs=-1))   
])

# usefull variables
folds = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
scoring = dict(recall='recall', precison='precision', f1='f1', accuracy='accuracy')

# auc
probas = cross_val_predict(base_clf, X_train, y_train, cv=folds, n_jobs=-1, method='predict_proba' )[:,1]
auc = roc_auc_score(y_train, probas)

# baseline metrics
base_clf_scores = cross_validate(base_clf, X_train, y_train, scoring=scoring, cv=folds, n_jobs=-1)

# consolidate model
base_clf_scores.update(dict(model=base_clf, auc=auc))